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While studies show that autism is highly heritable, the nature of 
the genetic basis of this disorder remains illusive. Based on the idea 
that highly correlated genes are functionally interrelated and more 
likely to affect risk, we develop a novel statistical tool to find more po¬ 
tentially autism risk genes by combining the genetic association scores 
with gene co-expression in specific brain regions and periods of de¬ 
velopment. The gene dependence network is estimated using a novel 
partial neighborhood selection (PNS) algorithm, where node specific 
properties are incorporated into network estimation for improved sta¬ 
tistical and computational efficiency. Then we adopt a hidden Markov 
random field (HMRF) model to combine the estimated network and 
the genetic association scores in a systematic manner. The proposed 
modeling framework can be naturally extended to incorporate ad¬ 
ditional structural information concerning the dependence between 
genes. Using currently available genetic association data from whole 
exome sequencing studies and brain gene expression levels, the pro¬ 
posed algorithm successfully identified 333 genes that plausibly affect 
autism risk. 

1. Introduction. Autism spectrum disorder (ASD), a neurodevelopmen- 
tal disorder, is characterized by impaired social interaction and restricted, 
repetitive behavior. Genetic variation is known to play a large role in risk 
for ASD [Gaugler et al. (2014), Kiel et al. (2012)], and yet efforts to identify 
inherited genetic variation contributing to risk have been remarkably un¬ 
successful [Anney et al. (2012), Liu et al. (2013)]. One explanation for this 
lack of success is the large number of genes that appear to confer risk for 
ASD [Buxbaum et al. (2012)]. Recent studies estimate this number to be 
near 1000 [Sanders et al. (2012), He et al. (2013)]. 
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The advent of next generation sequencing and affordable whole exome 
sequencing (WES) has led to significant breakthroughs in ASD risk gene 
discovery. Most notable is the ability to detect rare de novo mutations (i.e., 
new mutations) in affected individuals. These studies examine ASD trios, 
defined as an affected child with unaffected parents, to determine rare mu¬ 
tations present in the affected child, but not in the parents. A fraction of 
these mutations cause loss of function (LoF) in the gene. And when these 
rare damaging events are observed in a particular gene for multiple ASD 
trios, it lends strong evidence of causality [Sanders et al. (2012)]. While this 
approach has revolutionized the field, the accumulation of results is slow, rel¬ 
ative to the size of the task: to date, analysis of more than two thousand ASD 
trios has identified less than two dozen genes clearly involved in ASD risk 
[lossifov et al. (2012), Kong et al. (2012), Neale et al. (2012), O’Roak et al. 
(2011, 2012), Sanders et al. (2012), Willsey et al. (2013), De Rubeis et al. 
(2014)]. Extrapolating from these data suggest that tens of thousands of 
families would be required to identify even half of the risk genes [Buxbaum 
et al. (2012)]. At the same time, a single de novo LoF (dnLoF) event has 
been recorded for more than 200 genes in the available data. Probability 
arguments suggest that a sizable fraction of these single-hit genes are ASD 
genes [Sanders et al. (2012)], indicating that genetic data are already pro¬ 
viding partial information about more ASD genes. Thus, there is an urgent 
need to advance ASD gene discovery through the integration of additional 
biological data and more powerful statistical tests. 

The large number of genes with rare coding mutations identified by ex¬ 
ome sequencing presents an opportunity for the next wave of discoveries. 
Fortunately a key element in the path forward has recently been identified. 
ASD-related mutations have been shown to cluster meaningfully in a gene 
network derived from gene expression in the developing brain—specifically 
during the mid-fetal period in the frontal cortex [Willsey et al. (2013)]. These 
results support the hypothesis that genes underlying ASD risk can be or¬ 
ganized into a much smaller set of underlying subnetworks [Ben-David and 
Shifman (2012), Willsey et al. (2013), Parikshak et al. (2013)]. This leads to 
the conjecture that networks derived from gene expression can be utilized 
to discover risk genes. Liu et al. (2014) developed DAWNq, a statistical 
algorithm for “Detecting Association With Networks” that uses a hidden 
Markov random field (HMRF) model to discover clusters of risk genes in 
the gene network. Here we present DAWN, a greatly improved approach 
that provides a flexible and powerful statistical method for network assisted 
risk gene discovery. 

There are two main challenges to discovery of ASD genes: (1) weak genetic 
signals for association are spread out over a large set of genes; and (2) these 
signals are clustered in gene networks, but the networks are very high dimen¬ 
sional. Available data for network estimation are extremely limited, hence 
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the dimension of the problem is orders of magnitude greater than the sample 
size. A weakness of DAWNq lies in the approach to gene network construc¬ 
tion. The algorithm is based on discovering gene modules and estimating the 
edges connecting genes within a module based on the pairwise correlations. 
In contrast, DAWN estimates the conditional independence network of the 
genes under investigation. It achieves this goal utilizing a novel network esti¬ 
mation method that achieves a dimension reduction that is tightly linked to 
the genetic data. Our approach to network assisted estimation is based on 
three key conjectures: (i) autism risk nodes are more likely to be connected 
than nonrisk nodes; (ii) by focusing our network reconstruction efforts on 
portions of the graph that include risk nodes we can improve the chance 
that the key edges in the network that connect risk nodes are successfully 
identified and that fewer false edges are included; and (hi) the HMRF model 
will have greater power to detect true risk nodes when the network estima¬ 
tion procedure focuses on successfully reconstructing partial neighborhoods 
in the vicinity of risk nodes. 

The remainder of this paper is organized as follows. Section 2 presents 
data and background information. Section 3 presents the main idea of our 
testing procedure within a graphical model framework. First, we develop 
an algorithm for estimating the gene interaction network that integrates 
node-specific information. Second, we describe the HMRF model. Third, 
we extend our model to include the directed network information. Last, 
we develop in theory to motivate why our network estimation procedure 
is more precise when node-specific information is integrated. In Section 4 
simulation experiments compare our approach with other network estima¬ 
tion algorithms. In Section 5 we apply our procedure to the latest available 
autism data. 

2. Background and data. 

2.1. Genetic signal. DAWN requires evidence for genetic association for 
each gene in the network. While this can be derived from any gene-based test 
for association, a natural choice is TADA, the Transmission And De novo 
Association test [He et al. (2013)]. For this investigation, TADA scores were 
calculated using WES data from seventeen distinct sample sets consisting of 
16,098 DNA samples and 3871 ASD cases [De Rubeis et al. (2014)]. Using a 
gene-based likelihood model, TADA produces a test statistic for each gene 
in the genome. Based on these data, 18 genes incurred at least two dnLoF 
mutations and 256 incurred exactly once. Any gene with more than one 
dnLoF mutation is considered a “high confidence” ASD gene and those 
with exactly one are classified as a “probable” ASD genes due to the near 
certainty (>99%) and relatively high probability (>30%) the gene is a risk 
gene, respectively [Willsey et al. (2013)]. Based on TADA analysis of all 
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genes covered by WES, 33 genes have false discovery rate (FDR) q-values 
<10% and 107 have g-values <30%. Thus, in total, this is a rich source of 
genetic data from which to make additional discoveries of ASD risk genes 
and subnetworks of risk genes. 

2.2. Gene networks. The major source of data from which to infer the 
gene-gene interaction network is gene expression levels in specific tissues, 
which are obtained by high throughput microarray techniques. Using the 
BrainSpan transcriptome data set [Kang et al. (2011)], Willsey et al. (2013) 
examined the coexpression patterns across space and time of genes with 
at least one dnLoF mutation. The data originate from 16 regions of the 
human brain sampled in 57 postmortem brains ranging from 6 weeks post¬ 
conception to 82 years of age. By identifying the region and developmental 
period of the brain in which ASD genes tend to cluster, their investigation 
confirmed that gene expression networks are meaningful for organization 
and inter-relationships of ASD genes. Specifically, they identified prefrontal 
and motor-somatosensory neocortex (FC) during the mid-fetal period as the 
most relevant spatial/temporal choice. While each brain is measured at only 
one point in time, combining gene expression from the five frontal cortex re¬ 
gions with the primary somatosensory cortex, multiple observations can be 
obtained per sample. Nevertheless, the sample size was very small: for exam¬ 
ple, for fetal development spanning 10-19 weeks post-conception, 14 brains, 
constituting 140 total samples, were available from which to determine the 
gene network. 

Another type of network is the gene regulation network, which is a di¬ 
rected network. By studying the ChIP-chip data or the ChIP-seq data, one 
can obtain which genes are regulated by particular transcription factors 
(TFs). Many available gene regulation networks have already been studied 
and integrated into a large database called ChEA [Lachmann et al. (2010)]. 
But this kind of network is far from complete. Here we incorporate the TF 
network for a single gene (FMRP) to illustrate how this type of information 
might be utilized in the hunt for ASD risk genes. 

2.3. Network estimation. To estimate the gene co-expression network by 
expression levels, in general, there are three types of approaches. The most 
straightforward way is to apply a correlation threshold: the connectivity of 
two genes is determined by whether the absolute correlation is larger than 
a fixed threshold. This is the approach taken in the popular systems biol¬ 
ogy software tool known as Weighted Gene Co-expression Network Anal¬ 
ysis (WGCNA) [Langfelder and Horvath (2008)]. This tool is frequently 
used to discover networks of genes, or modules, with high coexpression. The 
DAWNa algorithm used this principle to construct a gene correlation net¬ 
work [Liu et al. (2014)]. Using WGCNA, modules were formed based on 


NETWORK ASSISTED ANALYSIS OF AUTISM RISK 


5 


the dendrogram with the goal of partitioning genes into highly connected 
subunits. Next, to generate a relatively sparse network within each mod¬ 
ule, genes with very high correlation were clustered together into multi-gene 
supernodes. The motivation for pre-clustering highly correlated genes as su¬ 
pernodes was to create a network that is not dominated by local subsets of 
highly connected genes. By grouping these subsets of genes into supernodes, 
the broader pattern of network connections was more apparent. Finally, the 
gene network was constructed by connecting supernodes using a correlation 
threshold. 

A major innovation of the DAWN algorithm developed in this paper is a 
more efficient network estimation method with better statistical interpreta¬ 
tion. Constructing a network based on correlations has two advantages: it 
is computationally efficient and the edges can be estimated reliably using a 
small sample. In contrast, the conditional independence network is sparser 
and has greater interpretability, but it is much harder to estimate. Assum¬ 
ing that the gene expression levels follow a multivariate normal distribution, 
the conditional independence can be recovered by estimating the support of 
the inverse covariance matrix of the expression data. One approach is to 
estimate the inverse covariance matrix directly using penalized maximum 
likelihood approaches [Friedman, Hastie and Tibshirani (2008), Cai, Liu 
and Luo (2011), Cai, Liu and Zhou (2012), Ma, Xue and Zou (2013)]. Alter¬ 
natively, the neighborhood selection method is based on sparse regression 
techniques to select the pairs of genes with nonzero partial correlations. For 
instance, Meinshausen and Biihlmann (2006) applied LASSO for the neigh¬ 
borhood selection of each gene and then construct the adjacency matrix 
by aggregating the nonzero partial correlation obtained for each regression. 
Peng, Zhou and Zhu (2009) proposed a joint sparse regression method for 
estimating the inverse covariance matrix. A challenge for both the neighbor 
selection method and the maximum likelihood approach is that the number 
of expression samples available is two orders of magnitude smaller than the 
number of genes. In most applications that utilize LASSO-based methods 
this challenge is diminished by simply estimating the gene network for sev¬ 
eral hundred genes. For example. Tan et ah (2014) use a sample of size 400 
to estimate a gene network for 500 genes. For this application we wish to 
explore the full range of genes that might be involved in risk for autism, 
and thus we cannot reduce the dimension in a naive manner. One may also 
consider inverting an estimated covariance matrix [Schafer and Strimmer 
(2005), Opgen-Rhein and Strimmer (2007)]. But in high dimensions the ma¬ 
trix inversion may be too noisy. DAWN takes a novel approach to dimension 
reduction to optimize the chance of retaining genes of interest. 

2.4. Networks and feature selection. Many previous papers have dis¬ 
cussed how to incorporate the estimated network into the feature selection 


6 


L. LIU, J. LEI AND K. ROEDER 


problems, namely, DAPPLE [Rossin et al. (2011)], GRAIL [Raychaudhuri 
et al. (2009)1 metaRanker [Pers et al. (2013)], Hotnet [Vandin, Upfal and 
Raphael (2011)], VEGAS [Liu et al. (2010)] and penalized methods [Mairal 
and Yu (2013)]. However, none of these methods control the rate of false 
discovery. 

Motivated by the work of Li, Wei and Maris (2010) and Wei and Pan 
(2008), DAWNq, applied a HMRF model to integrate the gene network into 
a powerful risk gene detection procedure. In principle, this approach captures 
the stochastic dependence structure of both TADA genetic scores and the 
gene-gene interactions, while being able to provide posterior probability of 
risk association for each gene and thus control the rate of false discovery. In 
practice, DAWN„ has a weakness due to the multi-gene nodes that define 
the networks. This complication led to several statistical challenges in the 
implementation of the algorithm. Notably, a post-hoc analysis is required to 
determine which gene(s) within a multi-gene node are associated with the 
phenotype. With DAWN we can capture the strengths of the natural pairing 
of the gene network and the HMRF model without these added challenges. 

3. Methods. The TADA scores together with the gene-gene interaction 
network provide a rich source of information from which to discover ASD 
genes. To obtain useful information from these data sets, we will need to 
utilize existing tools and also to develop novel statistical procedures that 
can overcome several challenges. Our model incorporates 3 main features: 
(1) Based on DAWN^, a HMRF model combines the network structure and 
individual TADA scores in a systematic manner that facilitates statistical 
inference. (2) To obtain the most power from this model, we require a sparse 
estimate of the gene-gene interaction network, but the sample size is insuffi¬ 
cient to yield a reliable estimate of the full gene network (approximately 100 
observations and 20,000 genes). However, based on the form of the HMRF 
model, it is apparent that it is sufficient to estimate the sub-network of the 
gene-gene interaction network that is particularly relevant to autism risk. 
We provide a novel approach to achieving this goal. (3) Finally, the statis¬ 
tical model efficiently incorporates additional covariates, for instance, the 
targets of key transcription factors that may regulate the gene network, to 
predict autism risk genes. 

Feature two is the most challenging. Under the high-dimensional setting, 
the existing network estimation approaches are neither efficient nor accu¬ 
rate enough to successfully estimate the network. To optimize information 
available in a small sample size, we need to target our efforts to capture 
the dependent structure between disease-associated genes and their nearest 
neighbors. DAWN uses a novel partial neighborhood selection (PNS) ap¬ 
proach to attain this goal. By incorporating node-specific information, this 
approach focuses on estimating edges between likely risk genes so that it 
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reduces the complexity of the large-scale network estimation problem and 
provides a disease-specific network for the HRMF procedure. 

To incorporate the estimated network into the risk gene detection proce¬ 
dure, feature one involves simplifying the HMRF model already developed 
for DAWNq, to integrate the estimated network and the genetic data. By 
applying the proposed model, the posterior probability of each gene being 
a risk gene can be obtained based on both the genetic evidence and neigh¬ 
borhood information from the estimated gene network. 

If additional gene dependence information such as targets of transcription 
factor networks are available, they can be incorporated naturally into the risk 
gene detection procedure so that better power can be achieved. To this end, 
for feature three, we extend the Ising model by adding another parameter 
to characterize the effect of such additional dependence information. This 
allows simple estimation and inference using essentially the same procedure. 

3.1. Partial neighborhood selection for network estimation. To estimate 
a high-dimensional disease-specific gene network with small sample size 
data, we propose the partial neighborhood selection (PNS) method. Let 
Xi,...,A„ be the samples from d-dimensional Gaussian random variables 
with covariance matrix S. Our goal at this stage is to estimate the support 
of the inverse matrix of S, which is an adjacency matrix fl. To maximize the 
power of the follow-up HMRF algorithm, the estimated adjacency matrix 
should be as precise as possible. But, given the high dimensionality and the 
small sample size, estimating the support of the entire precision matrix is 
a very ambitious goal. To overcome this challenge, it has been noted that 
ignoring some components of the high-dimensional parameter will lead to 
better estimation accuracy [Levina and Bickel (2004)]. Here we follow this 
rationale by estimating entries 0,{i,j) for a set of selected entries {i,j), and 
setting j) = 0 for other entries. Such a selective estimation approach will 
inevitably cause some bias, as many components of the parameter of interest 
are assigned a null value. However, this approach has the potential to greatly 
reduce the estimation variance for the selected components, as the reduced 
estimation problem has much lower dimensionality. Such a procedure is par¬ 
ticularly useful in situations where some low-dimensional components of the 
parameter are more important for subsequent inference. We will need to 
choose the zero entries carefully so that the bias is controlled. Because our 
ultimate goal is to detect the risk genes associated with a particular dis¬ 
ease, the dependence structure between risk genes is more essential in the 
procedure rather than the dependence between nonrisk genes. Specifically, 
we target for genes i and j with higher TADA scores and their high 

correlation neighbors. Such a choice can be supported by the HMRF model 
described in Section 3.2 as well as the theoretical results in Section 3.4. 
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Algorithm 1 PNS algorithm 

1. p-value screening: Exclude any nodes with p-value pi > t. The remaining 
nodes define 5" = {i : p* < t}. 

2. Correlation screening: Construct a graph G' = where is an 

adjacency matrix with OC = I{\pij \ > r}, where pij is the pairwise corre¬ 
lation between the ith and jth node. Then exclude all isolated nodes to 
obtain S = S' \{j-. 0.'^- = 0}. 

3. Retrieving neighbors: Retrieve all possible first order neighbors of nodes 
in S and obtain node set V, where E = S' U {j : \pij \ > T,i G S'}. 

4. Constructing graph: Apply Meinshausen and Biihlmann’s (2006) 
regression-based approach to select the edges among nodes in S and 
between nodes in S and E/S' by minimizing the following di individual 
loss functions separately: 


(3.1) 



Xi- 








where di is the number of nodes in S', A is the regularization parameter. 
Then the graph of E is constructed as Hjj = 1 — (1 — Eij){l — Eji), 
where matrix E is 

^ r/{|A,|>0}, ieS, 

1 0 , i^S. 

5. Return G = (E, 11). 


In the PNS algorithm (Algorithm 1), the p-values for each gene are uti¬ 
lized as the node-specific information for the network estimation. In step 1, 
we start with the key genes, S', defined as those genes with relatively small 
TADA p-values. In step 2, we further screen the key genes by excluding any 
elements that are not substantially co-expressed with any other measured 
genes. This step is taken because the upcoming HMRF model is applied 
to networks. Genes that are not highly co-expressed with any other genes 
are not truly functioning in the network. The resulting set, S, establishes 
the core of the network. In the third step we expand the gene set to E by 
retrieving all likely neighbors of genes in the set S. The likely partial correla¬ 
tion neighbors of gene j G S are identified based on the absolute correlation 
\Pij \ > T- The superset E includes all likely risk genes and their neighbors, 
but excludes all portions of the gene network that are free of genetic signals 
for risk based on the TADA scores. Similar correlation thresholding ideas 
have been considered in Butte and Kohane (1999), Yip and Horvath (2007), 
Luo et al. (2007). Then we apply the neighborhood selection method [Mein¬ 
shausen and Biihlmann (2006)] for each gene in the set S to decide which 
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genes are the true neighbors of risk genes. Note that the estimated graph 
does not contain possible edges between nodes in 1/ \ 5, but the edges that 
link nodes in 1/ \ S' will not affect the results of our follow-up algorithm, 
so it is much more efficient to not estimate those edges when we estimate 
the disease-specific network. In the fourth step we apply the neighborhood 
selection algorithm to the subnetwork V. 

Setting threshold values in gene screening. The PNS algorithm uses two 
tuning parameters, t and r, in the screening stage. The choice of t and r 
shall lead to a good reduction in the number of genes entering the network 
reconstruction step, while keeping most of the important genes. A practical 
way of choosing t would be to match some prior subject knowledge about 
the proportion of risk genes. In general, t shall not be too small in order to 
avoid substantial loss of important genes. The choice of r is more flexible, 
depending on the size of the problem and available computational resource. 
In our autism data the number of genes is very large, therefore, a relatively 
large value of r is necessary. The choice of r = 0.7 has been used for gene 
correlation thresholding in the literature [see, e.g.. Yip and Horvath (2007), 
Luo et al. (2007), Willsey et al. (2013)]. In our simulation study, we find 
that the performance of PNS is stable as long as t is not overly small, and 
is insensitive to the choice of r. More details are given in Section 4.2. 

Choosing the tuning parameter in sparse regression. Finding the right 
amount of regularization in sparse support recovery remains an open and 
challenging problem. Meinshausen and Biihlmann (2010) and Liu, Roeder 
and Wasserman (2010) proposed a stability approach to select the tuning 
parameter; however, due to the high-dimension-low-sample-size scenario, the 
subsampling used in this approach reduces the number of samples to an un¬ 
desirable level. Li et al. (2011) proposed selecting the tuning parameter by 
controlling the FDR, but the FDR cannot be easily estimated in this context. 
Lederer and Muller (2014a) suggested an alternative tuning-free variable se¬ 
lection procedure for high-dimensional problems known as TREX. Graphical 
TREX (GTREX) extends this approach to graphical models [Lederer and 
Muller (2014b)]. Although this approach produced promising results in sim¬ 
ulated data, it relies on subsampling. Gonsequently, for some data sets the 
sample size will be a limiting factor. 

A parametric alternative relies on an assumption that the network follows 
a power law, that is, the probability a node connects to k other nodes is equal 
to p{k) ~ k~'^. This assumption is often made for gene expression networks 
[Zhang and Horvath (2005)]. To measure how well a network conforms to 
this law, assess the square of correlation E? between \ogp{k) and log(A:): 

(3.2) = {corr{\ogp{k),\og{k))f. 

R? = 1 indicates that the estimated network follows the power law perfectly, 
hence, the larger the R?, the closer the estimated network is to achieve the 


10 


L. LIU, J. LEI AND K. ROEDER 


scale-free criteria. In practice, the tuning parameter, A, can be chosen by 
visualizing the scatter plot of B? as a function of A. There is no guarantee 
that the power law is applicable to a given network [Khanin and Wit (2006)], 
and this approach will not perform well if the assumption is violated. As 
applied in the PNS algorithm, the assumption is that the select set of genes 
in V follow the power law. The PNS subnetwork is not randomly sampled 
from the full network, as it integrates the p-value and the expression data 
to select portions of the network rather than random nodes. It has been 
noted in the literature [Stumpf, Wiuf and May (2005)] that the scale-free 
property of degree distribution of a random subnetwork may deviate from 
that of the original full network; however, the deviation is usually small. We 
find the scale-free criterion suitable for the autism data sets considered in 
this paper. However, the general performance of PNS and DAWN does not 
crucially depend on this assumption, as we demonstrate in the simulation 
study in Section 4.2. 

3.2. Hidden Markov random field model. Gene-based tests such as TADA 
reveal very few genes with a p-value that passes the threshold for genome¬ 
wide significance. However, after taking the gene interaction network into 
consideration, we usually find that some genes with small p-values are clus¬ 
tered. The p-values of those genes are usually not significant individually, 
but this clustering of small p-values in the network is highly unlikely to 
happen by chance. To enhance the power to detect risk genes, we adopt a 
HMRF model to find risk genes by discovering genes that are clustered with 
other likely risk genes. 

First we convert the p-values to normal Z-scores, Z = (Zi;...;Z„), to 
obtain a measure of the evidence of disease association for each gene. These 
Z-scores are assumed to have a Gaussian mixture distribution, where the 
mixture membership of Zj is determined by the hidden state /*, which in¬ 
dicates whether or not gene i is a risk gene. We assume that each of the 
Z-scores under the null hypothesis (/ = 0) has a normal distribution with 
mean 0 and variance Uq, while under the alternative (/ = 1) the Z-scores 
approximately follow a shifted normal distribution, with a mean p and vari¬ 
ance a\. Further, we assume that the Z-scores are conditionally independent 
given the hidden indicators I = (R,..., /„). The model can be expressed as 

(3.3) Zi ~ P{h = 0)iV(0, al) + P{h = I)iV(p, af). 

The dependence structure reduces to the dependence of hidden states fi. 
To model the dependence structure of fi, we consider a simple Ising model 
with probability mass function 

P(I = 1 ]) oc exp{b^r] + crf^rf) for all rj G {0,1}”. 


(3.4) 
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Algorithm 2 HMRF parameter estimation 

1. Initialize the states of node /* = 1 if Zj > -^thres and 0 otherwise. 

2. For t = 1,... ,T 

(a) Update (6^*^, ) by maximizing the pseudo likelihood 

_expjfe/t + cliilj.l} _ 

exp{6Ii + c/jOj.!} + exp{6(l - li) + c(l - Ii)Qi.l}' 

(b) Apply a single cycle of the iterative conditional mode [ICM, Besag 
(1986)] algorithm to update I. Specifically, we obtain a new based 
on 

(c) Update ) to 

^ E,m = l|Z,6M;cW)Z, 

E*m = l|Z,6W;cW) ’ 

„2(u ^ E^m = o|z,6W;cW)^f 

° E.m = 0|Z,6W;cW) ’ 

.2it) ^ = 

E.m = l|Z,6W;cW) 

3. Return (b,c, fi,aQ,af) = 


We apply the iterative algorithm (Algorithm 2) to estimate the parame¬ 
ters and the posterior probability of P(/i|Z). 

After the posterior probability of P(/j|Z,I_j) is obtained, we apply Gibbs 
sampling to estimate the posterior probability qi = P{Ii = OjZ). Finally, let 
( 7 (j) be the sorted posterior probability in ascending order; the Bayesian FDR 
correction [Muller, Parmigiani and Rice (2006)] of the Rh sorted gene can 
be calculated as 

i 

(3.5) FDR; = ^ 

Genes with FDR less than a are selected as the risk genes. 

In summary, the DAWN algorithm (Algorithm 3) consists of four steps. 
The HMRF component of DAWN^ is similar in spirit to what is described 
here for DAWN, but the implementation in the former algorithm is consid¬ 
erably less powerful due to multi-gene nodes. DAWN^ cannot directly infer 
risk status of genes from the estimated status of the node. 
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Algorithm 3 DAWN algorithm 

1. Obtain gene specific p-values. 

2. Estimate the gene network using the PNS algorithm (Algorithm 1). 

3. Incorporate the information from steps 1 and 2 into the HMRF model 
and estimate the parameters of the HMRF model (Algorithm 2). 

4. Apply the Bayesian FDR correction to determine the risk genes [equation 
(3.5)]. 


3.3. Extending the Ising model. Our framework is general and flexible 
enough to incorporate additional biological information such as the TF net¬ 
work information by naturally extending the Ising model. Under this ex¬ 
tended model, we can incorporate a directed network such as the TF network 
along with the undirected network such as the gene co-expression network. 
From the microarray gene expression levels, an undirected network could be 
estimated based on the PNS algorithm. With the TF network information, 
we could also estimate a directed network that indicates which genes are reg¬ 
ulated by specific TF genes. This additional information can be naturally 
modeled in the Ising model framework by allowing the model parameter to 
be shifted for particular collection of TF binding sites. The density function 
of this more general Ising model is as follows: 

(3.6) P(I = r]) (X exp(6l'T/ -|- crfQ.ri + dH'rj), 

where H = {hi ,..., hn) is the indicator of TF binding sites, and d> 0 reflects 
the enhanced probability of risk for genes regulated by TF. 

If d > 0, this indicates that the TF binding site covariate is a predictor 
of risk for diseases. To test whether or not d is significantly larger than 
zero, we compare the observed statistic d with d obtained under the null 
hypothesis of no association. To this end, we adopt a smoothed bootstrap 
procedure which simulates data with the same clustering of the observed 
genetic signals, but without an association with the TF binding site. 

To simulate Z from the null model, we first simulate the hidden states 
I from the distribution (3.4). We randomly assign initial values of I to 
each node in the network and the proportion of nodes with 1 = 1 is r, 
where r G (0,1) is a pre-chosen value, for example, 0.1. Then, we apply a 
Metropolis-Hastings algorithm to update I until convergence. The full boot¬ 
strap procedure is described in Algorithm 4. 

For presentation simplicity we describe the idea of incorporating addi¬ 
tional subject knowledge into the Ising model for a single TF. The procedure 
can be straightforwardly extended to incorporate multiple TFs. In this case, 
the Ising model for the hidden vector I becomes 

/ ^ 

P(I = rj) (X exp j bl'rj + cq'ilr] + E dkH'kig 

\ k=l 





NETWORK ASSISTED ANALYSIS OE AUTISM RISK 


13 


Algorithm 4 

1. Apply the algorithm to model (3.4) to obtain estimates of the model 
parameters. 

2. Using the estimated null model, simulate I* by the Metropolis-Hastings 
algorithm, then simulate Z* using equation (3.3). 

3. Using model (3.6), estimate the parameters for the simulated data. 

4. Repeat steps 2-3 B times, the B copies of estimated d can be used as a 
reference distribution of the estimated parameter under the null model. 

5. Output the p-value P = ^ > d}. 


The bootstrap testing procedure described in Algorithm 4 also carries over 
in an obvious manner to the multiple TF case. 

3.4. More about partial neighborhood selection. In this section we discuss 
theory that explains why PNS can more precisely estimate edges between 
risk genes. We find that under the Ising model, nodes with similar properties 
are more likely to be connected with each other in the network. Therefore, 
by utilizing this property of the Ising model, we can greatly improve the 
accuracy of estimating a disease-specihc network. 

The following theorem suggests that the larger the Z-scores are for the 
two nodes, the more likely there is an edge connecting those two nodes. 
Therefore, it is reasonable to adapt the lasso regression to retrieve neighbors 
of only candidate risk genes, which are the genes that have small p-values. 
This choice is justihed because those genes are more likely to be connected 
with other genes. 

Theorem 3.1. Assume that (Z,I) are distributed according to the HMRF 
in equations (3.3) and (3.4). Assume that U has independent entries. Let 
U'= (^ 1 ) ^ 2 ) / (u j)}- be the set of all possible Ll'. Define R and 

Ij as the ith and jth element of I, V = {Ii,l 2 , ■ ■ ■ ,dd)/{Ii,Ij}, and the set 
of all possible 1 ' . Then for any Ll' ^ s/ and any 1 ' € 3d, P{Llij = 1|Z,U',T) 
is an increasing function of Zi and Zj. 

Theorem 3.1 provides some justihcation for the p-value thresholding in 
the PNS algorithm. An important condition here is that I is distributed as 
an Ising model where the conditional independence is modeled by the binary 
matrix U. In practice, if U is estimated from some other data source, then it 
is possible that Q may not be relevant to reflect the independence structure 
of I. This is not the case in our application, as the gene co-expression data is 
collected from the BrainSpan data for the frontal cortex sampled during the 
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mid-fetal developmental period because it has been shown that this space¬ 
time-tissue combination is particularly relevant to autism [Willsey et al. 
(2013)]. 


Proof of Theorem 3.1. Let k = l represent = (1,1), k = 2 

represent = (1,0), k = 3 represent = (0,1), and k = A represent 

= (0,0). Then 

p{n^j = i|z,ii',r) 

4 

= = 1, (/*,/i) = A:|Z,Li',lO 

k=l 

4 

= = i\{ii,ij) = k,z,n',r)p{{ii,ij) = k\z,n',i') 

k=l 

4 

= ^P(Qij = l\(Ii,I,) = k,Q',l')P((Ii,Ij) = k\Z,VL',l') 

k=l 


4 

= Ml (T, VL')Pi +Y,M2 (!', VL')Pk , 

k=2 


whereMi(I',0') =P(a,i = = k,n',V) and Pk = Pi{Ii, Ij) = k\Z,n',V 

Taking a derivative of P{^ij = 1|Z,11',T) with respect to Zj, we have 

aP(fl, ^ l|Z,fl^ ^ ^ m ^ ^ fdP2 


dZi 


dZi 




dZi dZi dZi 


= Mi(T,L!') X ^ - M2(I',11') X 

f) P 

= (Mi(T,11')-M2(I',11'))x^. 


Based on Lemma 3.1, we have Mi(I', fl') — M 2 (I^ 11') > 0. Based on Lemma 3.2, 
we have > 0. Thus, we obtain d ^ P{VLij = 1|Z, 11', I') 

is an increasing function of Zj. Similarly, we obtain that P{^ij = 1|Z,11',I') 
is also an increasing function of Zj. □ 


The theorem above reveals the specific structure of the adjacency matrix 
for the network in the Ising model setting. This kind of adjacency matrix 
has more edges in the block of risk genes and fewer edges in the block of 
nonrisk genes. Thus, given this specific structure and limited sample size, 
it is reasonable to focus on estimating the edges between genes with small 
p-values. Therefore, under the Ising model, the proposed PNS algorithm is 
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a more precise network estimation procedure than other existing network 
estimating procedures, which all ignore the node-specific information. 


Lemma 3.1. Under the same eonditions as in Theorem 3.1, for any 
O' G j?/ and any I' G 3^, 


= O') 


Mi(I', O') > = 1), if h = Ij = 1, 

M 2 (I', o') < P{Tlij = 1), otherwise. 


Proof. 

P(0i, = l|/i = /, = l,I',0') 

P(O,,=0|/i = /, = l,I',O') 

P(0,, = l,/, = Jj = l,I',0') 

P(Oy =0,/i = /, = l,I',O') 

_ P{Ii = Ij = l,I'|Oi,- = l,0')P(0y = l)P(O') 
P{Ii = Ij = l,I'|Oij =0,O')P(Oij =0)P(O')’ 

Let 

/(/i,/,-,l'|0y,0')=exp(-6'l + cl‘0I), 
where = I^I?, = = Ij- We define 

Ti (!', Hij , O') = / (I, = 1, /, = 1,1' 10,,-, O'), 
r2(i',0i,-,0') = /(/, = 1,/, = 0,I'|0,,-,0'), 

r3(i',Oi,-,o') = /(/, = 0,/, = i,i'|o,,-,o'), 
r4(i', Oi,-, o') = /(/, = 0,= 0, i'|o,,-, o'). 


Then we obtain that 

P(/i = /, = l,l'|Oy = 1,0') 

P(/, = /, = l,T|Oi,=0,O') 


ri(i',o,j = i,o') 

ri(i',o,, = o,o') 


It is easy to show that 

Ti(T,Oy = 1) = ri(T,Ojj =0) X exp{2c} 


and 


7fc(I^ Ojj = 0) = Tfc(l', Ojj = 1) 


/c = 2,3,4. 
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Therefore, 

p(/i = /, = i,r|ffij=o,oo 

^_ EJ^s^ELl^fc(J^^u = o.^O _ 

Ej'g^{EL 2 Tk{i'^ = 0> ^') + n,, = 0, n') x exp{2c}} 

Ti(I', i^ij = 0, O') X exp{2c} 

"" Ti(r,Oy = 0,0') 

^_ Ej'6i^exp{2c}X]Li^fc(J',Ou = 0,00 _ 

Ej'g^{Efe=2^fc(J0^U = 0,0') +ri(J',Oij = 0,O') x exp{2c}} 
>1 if c > 0. 


Thus, we obtain 


P(Oij = l\Ii = Ij = 1,1', O') ^ P(0,,- = 1) 
P(0y = o\ii = I, = 1,1', O') ^ P{n,j = 0) ’ 


which leads to 


P(Oij = l\Ii = Ij = 1,1', O') = Mi(l',0') > 


Pj^ij = 1) 

P{Qjij = 0 ) + P{flij = 1 ) 


Similarly, for any O' G and any I' G we obtain 


P((/„J,) = fc,I'|0,, = l,0') 

P( (/*,/,) = fc, I'I Oi,= 0,0') 

__ Ej'g^ Ez=i ^ij — 0, ^') _ 

~ Ej'g^{Et 2 rKJ',a,= 0 ,O') + ri(J',Oi,=0,O') xexp{2c}} 
<1 when A: = 2,3,4, 


where k = 2 means = (1,0), k = 3 means = (0,1), and fc = 4 

means = (0,0). Then it is easy to obtain 

P(Oi,' = l\(Ii,P) = A;,I',0') = Mfc(T,0') < —- ^ 

^ ^ ^ fev , ; p(Oij = 0) + P(Oij = 1) 

From equation (3.7) it is clear that Mfc(I',0') does not depend on k, thus 
Mfc(T,0') = M2(I',0') for A: = 2,3,4. □ 


Based on Lemma 3.1, we know that if a pair of nodes has two risk nodes, 
then this pair of nodes are more likely to be connected with an edge than 
the pairs of nodes with only one risk node or no risk nodes. 
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Lemma 3.2. Under the same eonditions as in Theorem 3.1, for any 
Ut! ^ si and any V G llS, P{Ii = l,Ij = 1|L,0',Z) is an inereasing funetion 
of Zi and Zj. 

Proof. We first derive the conditional probability of U = Ij = 1 given 
I', O' and Z: 

Pi = P(/i = 1, = 1|I', O', Z) = P{Ii = 1, 1, = 1,1', O', Z)/P(I', O', Z) 

= P(Z|/i = 1, 1, = 1,1', 0')P(/i = 1, Ij = 1,1', 0')/P(l', O', Z) 

= P{Zi\Ii = l)P{Zj\I, = l)P(Z_i,_,|l')P(/i = 1,7, = 1,1', O') 
/P(I',0',Z) 

= P{Zi\Ii = l)P{Zj\I, = l)P(Z_i,_,|l')P(/, = 1,7, = 1,I'|0')P(0') 
/P(I',0',Z) 

= P{Zi\Ii = l)P{Zj\I, = l)P(Z_i,_,|l')P(7, = 1,7, = 1,I'|0') 
/P(I',Z|0'). 

Similarly, we obtain 

P 2 = P(7, = 1,7, =0|l',O',Z) 

= P{Zi\h = l)P{Zj\Ij = 0)P(Z_i,_,|l')P(7, = 1,7,- = 0,l'|O') 
/P(I',Z|0'), 

P 3 = P(7i = 0,7, = l|l',O',Z) 

= P{Zi\h = 0)P(Z,|7,- = l)P(Z_i,_,|l')P(7, = 0,7,- = 1,I'|0') 
/P(I',Z|0'), 

P 4 = P(7i = 0,7, =0|l',O',Z) 

= P{Z^\h = 0)P(Z,|7,- = 0)P(Z_i,_,|l')P(7, = 0,7,- = 0,l'|O') 
/P(I',Z|0'). 

We further dehne 

Cl = P(Z_i,_,|l')P(7, = 1,7, = 1,I'|0'), 

C 2 = P(Z_i,_,|l')P(7, = 1,7, = 0,l'|O'), 

Ca = P(Z_i,_,|l')P(7, = 0,7, = 1,I'|0'), 

(O 4 = P(Z_i,_,|I')P(7, = 0,7, = 0,l'|O'). 

Since 

P(7i,7,-,I'|0') = P(I|0)P(0i,- = 1) +P(I|0)P(0i,- =0), 
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it is then clear that Ck, A: = 1,..., 4 is independent with Zi, Zj. Since 



therefore PijP 2 is an increasing function of Zj and independent with Zj. 
Similarly, we obtain P 1 /P 3 is an increasing function of Zj and independent 
with Zj, and Pi/Pi is an increasing function of Zj and Zj. Since 


Pi 


1 


-Pi + -P2 + -P3 + -P4 1 + P2/P1 + P3/Pi + P4/Pi 


thus Pi is an increasing function of Zj and Zj. □ 

Lemma 3.2 suggests that a larger value of Z indicates a larger probability 
of being a risk node. The probability of being a risk node is an increasing 
function of the Z score, given the risk status of other nodes are fixed. 

4. Simulation. In this section we use simulated data to evaluate our pro¬ 
posed models and algorithms and demonstrate the efficacy of our proposed 
method. We simulate Z-scores and hidden states from the HMRF model as 
given in (3.3) and (3.4). The gene expression levels are simulated from a 
multivariate Gaussian distribution. First, we compare the proposed PNS al¬ 
gorithm with other existing high-dimensional graph estimation algorithms. 
Second, we compare the power to detect the risk genes using graphs esti¬ 
mated using a variety of graph estimation algorithms. Our objective is to 
determine if we can achieve better risk gene detection when we incorporate 
the network estimated by PNS into the HMRF risk gene detection proce¬ 
dure. This comparison also sheds light on the advantages of DAWN relative 
to DAWNo. 

4.1. Data generation. We adopt the B-A algorithm [Barabasi and Albert 
(1999)] to simulate a scale-free network G= (F, D), where V represents the 
list of nodes and the adjacency matrix of the network is denoted as 0. 
To obtain a positive definite precision matrix supported on the simulated 
network D, the smallest eigenvalue e of vD is first computed, where v is 
a chosen positive constant. We then set the precision matrix to be vD + 
(jej + u)ldxd, where Idxd is the identity matrix, d is the number of nodes, 
and u is another chosen positive number. Two constants v and u are set 
as 0.9 and 0.1 in our simulation. Finally, by inverting the precision matrix, 
we obtain the covariance matrix B. Gene expression levels, Xi,... ,Xn, are 
generated independently from A^(0, B). The sample size n is equal to 180 in 
our simulation. 

To simulate Z from (3.3), we first simulate the hidden states I from the 
Ising model (3.4). Initial values of I are randomly assigned to each node in 
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No. of nodes =400 No. of nodes =800 


(a) (b) 

Fig. 1. Simulated scale-free network, (a) number of nodes equals 400 , (b) number of 
nodes equals 800. 

the simulated graph and we let half of the nodes have initial values /* = 1. 
Then, we apply the standard Metropolis-Hastings algorithm to update I 
with 200 iterations. The parameters in the Ising model (3.4) are set as 6 = —7 
and c = 3 in our simulation. 

Figure 1 shows the generated scale-free network with the hidden states 
simulated from the Ising model. The numbers of nodes d are set at 400 and 
800, respectively. In Figure 1(a) there are in total 68 nodes with Ii = l and 
in Figure 1(b) there are in total 82 nodes with I* = 1. After the network 
and the hidden states embedded in the network are obtained, we simulate 
2 ;-score Zi based on model (3.3) with /r = 1.5, uo = 1 and ui = 1. 

4.2. Estimation and evaluation. Using the simulated data Xi,... ,Xn, 
we first estimate the graph with the PNS algorithm. The p-value threshold 
t is chosen to be 0.1 and the correlation threshold r is set at 0.1. Define 
the important edges as those edges connecting risk nodes. To evaluate the 
performance of the PNS algorithm in retrieving important edges, we compare 
the following three graph estimation algorithms: 

• PNS: The proposed PNS algorithm. 

• Glasso: Graphical lasso algorithm. 

• Gorrelation: Gompute the pairwise correlation matrix M from Ai,..., A„, 
then estimate graph Djj = I{\Mij \ > r}. 

To compare the performance of graph estimation, the FDR is dehned as 
the proportion of false edges among all the called edges. Power is dehned as 
the proportion of true, important edges that are called among all the im¬ 
portant edges in the true graph. Figure 2 shows that under the same FDR, 
PNS retrieves many more important edges than the Glasso and Gorrelation 
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- 1 - 1 - 1 - 

PNS GLASSO Correlation 


No. of nodes =800 

(b) 


Fig. 2. Power of important edge detection. The FDR of the three approaches is set at 

0.5. 


algorithms. Calling more true edges between risk nodes will improve per¬ 
formance, but calling more false edges will reduce the power of the HMRF 
algorithm. From the comparison in Figure 2, we see that when calling the 
same number of false edges, the PNS algorithm calls more true important 
edges, which suggests that the HMRF model can achieve better power when 
using the network estimated by the PNS algorithm. We will examine this 
conjecture by comparing the power of the HRMF model using networks es¬ 
timated with different algorithms. The tuning parameters for each model 
are chosen to yield a preset FDR. It is worth noting that here the PNS 
algorithm does not use the scale-free criterion to choose the sparsity param¬ 
eter A. Thus, the good performance of PNS does not really depend on the 
scale-free assumption. 

To evaluate the power of network assisted risk gene detection, we apply 
the HMRF model using an estimated graph D and the simulated ^;-score Z. 
We compare the following four approaches: 

• HMRF_PNS: Apply the HMRF algorithm by incorporating the graph es¬ 
timated by PNS. 

• HMRF_Glasso: Apply the HMRF algorithm by incorporating the graph 
estimated by Glasso. The tuning parameter of Glasso is chosen to make 
the estimated graph having the same number of edges with H, the network 
estimated by PNS. 

• HMRF .Oracle; Apply the HMRF algorithm by incorporating the true 
graph H. 

• Naive: Classify the nodes only based on the observed ^;-score Z. 
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Fig. 3. ROC curve, (a) number of nodes equals fOO, (b) number of nodes equals 800. 


Figure 3 shows the receiver operating characteristic (ROC) curve of the 
four approaches applied to a single data set. We see that by applying the 
HMRF model to incorporate the structural information via the PNS algo¬ 
rithm, the accuracy rate of classihcation can be largely improved. To eval¬ 
uate the robustness of the proposed algorithm, we repeat the simulation 
20 times and compare the true positive rates (TPR) obtained from each 
approach under the same false positive rate. From Table 1, we reach the 
same conclusion that HRMF_PNS performs much better than DAWN_a, 
HMRF_Glasso and the Naive method. 

This simulation experiment also yields insights into advantages of DAWN 
over DAWN„. A key difference between the algorithms is that DAWN^ 
utilizes an estimated correlation network, while DAWN relies on the PNS 
partial correlation network. Comparing the two approaches in Figures 2, 
3 and Table 1 reveals notable differences. It appears that the correlation 
network fails to capture a sizable portion of the correct edges of the graph. 
Consequently, the HMRF has a greater challenge discovering the clustered 


Table 1 

True positive rate comparison. The false positive rates are 
controlled at 0.1 



d = 400 

d = 800 

DAWN 

0.733 (0.02) 

0.732 (0.02) 

DAWN^ 

0.663 (0.02) 

0.612 (0.03) 

HMRF.Glasso 

0.670 (0.03) 

0.651 (0.01) 

HMRF .Oracle 

0.934 (0.02) 

0.917 (0.01) 

Naive 

0.585 (0.02) 

0.567 (0.01) 
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Table 2 

True positive rate comparison of DAWN under different parameters 



t = 0.06 

t = 0.08 

t = 0.1 

t = 0.12 

t = 0.14 

r = 0.05 
r = 0.1 

r = 0.15 

0.665 (0.01) 
0.666 (0.01) 
0.667 (0.01) 

0.709 (0.01) 
0.709 (0.01) 
0.708 (0.01) 

0.732 (0.02) 
0.732 (0.02) 
0.731 (0.02) 

0.717 (0.01) 
0.717 (0.01) 
0.717 (0.01) 

0.699 (0.01) 
0.698 (0.01) 
0.698 (0.01) 


signal. Overall, the simulations suggest that DAWN performs much better 
because it uses PNS to fit the graph. 

Next, we examine the robustness of our proposed DAWN under different 
tuning parameters. To generate Table 1, we chose t = 0.1 and r = 0.1. Now, 
we vary the tuning parameters t and r and reevaluate the performance 
of DAWN. For t we use five different values 0.06,0.08,0.1,0.12,0.14, and 
for T we use three different values 0.05,0.1,0.15. The comparison is made 
using the same 20 simulated data sets that were used to generate Table 1 
(node = 800). 

From Table 2 we see that the results of DAWN are not sensitive to the 
choice of r. The tuning parameter t does affect the performance of our algo¬ 
rithm. If t is too small, we will not have enough seed genes for constructing 
the network and too many pairs of key genes are missed. But as long as 
t is not too small, the performance of our algorithm is robust. Hence, it 
is reasonable to choose a t that is not too small because in the screen¬ 
ing stage we prefer over inclusion. Finally, comparing Tables 1 and 2, we see 
that for every combination of parameters, DAWN outperforms DAWNq, and 
HMRF_glasso. 

5. Analysis of autism data. Building on the ideas described in Section 2, 
Background and Data, we search for genes association with risk for autism. 
The gene expression data we use to estimate the network was produced and 
normalized by Kang et al. (2011). Willsey et al. (2013) identified the spa¬ 
tial/temporal choices crucial to neuron development and highly associated 
with autism. Networks were estimated from the FC during post-conception 
weeks 10-19 (early fetal) and 13-24 (mid fetal). Thus, we apply PNS to es¬ 
timate the gene network using brains in early FC and mid FC, respectively. 
For a given time period, all corresponding tissue samples were utilized. In 
the early FC period there are 140 observations and in the mid FC period 
there are 107 observations. To represent genetic association TADA p-values. 
Pi are obtained from De Rubeis et al. (2014) for each of the genes. 

The PNS algorithm is applied to early FC and mid FC separately. The 
p-value threshold t is chosen to be 0.1 and the correlation threshold r is set 
as 0.1. After the screening step, in early FC there are 6670 genes of which 
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Lambda 



(a) 


(b) 


Eig. 4. Scale-free topology criteria, (a) early FC (b) mid FC. 


834 genes have p-values less than 0.1, and in mid FC there are 7111 genes 
of which 897 genes have p-value less than 0.1. We define these genes with 
p-value less than 0.1 as key genes. To choose the tuning parameter A, we 
apply the scale-free criteria and plot the square of correlation E? [equation 
(3.2)] versus A in Figure 4. Based on the figure we select, A = 0.12 because 
it yields a reasonably high B? value in both periods. The full network of all 
analyzed genes in early FC contains 10,065 edges of which 1005 edges are 
between key genes, and the subnetwork of key genes is shown in Figure 5(a). 
The full network of all analyzed genes in mid FC contains 11,713 edges of 
which 1144 edges are between key genes, and the subnetwork of key genes 
is shown in Figure 5(b). 

After the networks are estimated, we assign z-scores to each node of the 
network and then apply the HMRF model to the network. The initial hidden 
states of genes are set as I{pi < 0.05}. We fix the hidden states of 8 known 
autism genes as 1. Those 8 known autism genes are ANK2, CHD8, CUL3, 
DYRKIA, GRIN2B, POGZ, SCN2A and TBRl based on Willsey et al. 
(2013). We then compute the Bayesian FDR value [Muller, Parmigiani and 
Rice (2006)] of each gene based on the posterior probability qi obtained from 
the HMRF algorithm. Under the FDR level of 0.1, we obtain 246 significant 
genes in early FC of which 114 have at least one identified dnLoF mutation. 
In mid FC we obtain 218 significant genes of which 115 have at least one 
dnLoF mutation. We combine the significant genes from those two periods 
and obtain in total 333 genes as our final risk gene list (Supplemental Table 1 
[Liu, Lei and Roeder (2015)]). Among them, 146 genes have at least one 
dnLoF mutation. Comparing to the number of genes discovered by TADA 
[De Rubeis et al. (2014)] where structural information of the genes was 
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(a) (b) 

Fig. 5. Result of HMRF^PNS algorithm for autism data, (a) early FC (b) mid FC. 

not incorporated, the power of risk gene detection has been substantially 
improved. The genes in the risk gene list are red in Figure 5. From the 
figure it is clear that those genes in the risk gene list are highly clustered in 
the network. 

In our risk gene list, in addition to the 8 known ASD genes, there are 10 
additional genes that have been identified as ASD genes [Betancur (2011)] 
(three syndromic: LI CAM, PTEN, STXBPl ; two with strong support from 
copy number and sequence studies: MBD5, SHANK2', and five with equivo¬ 
cal evidence: FOXGl, FOXPl, NRXNl, SCNlA, SYNC API). Fisher’s ex¬ 
act test shows significant enrichment for nominal ASD genes in our risk gene 
list (p-value = 2.9 x 10“®). 

Next we compare the performance of DAWNq, and DAWN on the autism 
data. Ranking the DAWN^ genes by FDR g-value, we retain the top 333 
genes for comparison. Autism risk genes are believed to be enriched for 
histone-modifier and chromatin-remodeling pathways [De Rubeis et al. (2014)] 
Comparing the DAWN„ and DAWN gene list with the 152 genes with 
histone-related domains, we find 9 of these designated genes are on the 
DAWNq, list (Fisher’s exact test p-value = 4.7 x 10“^) and 11 are on the 
DAWN list (p-value = 5.5 x 10“®). Thus, DAWN lends stronger support 
for the histone-hypothesis and, assuming the theory is correct, it suggests 
that DAWN provides greater biological insights, but this does not prove 
that DAWN is better at identifying autism risk genes. Using new data from 
lossifov et al. (2014), we can conduct a powerful validation experiment. Sum¬ 
marizing the findings from the 1643 additional trios sequenced in this study, 
we find 251 genes that have one or more additional dnLoF mutations. Based 
on previous studies of the distribution of dnLoF mutations, we know that 
a substantial fraction of these genes are likely autism genes [Sanders et al. 


NETWORK ASSISTED ANALYSIS OF AUTISM RISK 


25 


(2012)]. We find 18 and 24 of these genes are in the DAWNq, and DAWN 
lists, respectively. If we randomly select 333 genes from the full genome, on 
average, we expect to sample only 4-5 of the 251 genes. Thus, both lists 
are highly enriched with these probable autism genes (Fisher’s exact test 
p-value = 2.4 x 10“® and 3.4 x 10“^*^, resp.). From this comparison we con¬ 
clude that while both models are successful at identifying autism risk genes, 
DAWN is more powerful. 

We further investigate the robustness of our model to the lasso tuning 
parameter, A, by comparing the risk gene prediction set using two addi¬ 
tional choices bracketing our original selection. We identified 324, 333 and 
243 risk genes with FDR < 0.1 using A = 0.10, 0.12 and 0.15, respectively. 
Not surprisingly, the gene lists varied somewhat due to the strong depen¬ 
dence of the model on the estimated network; however, overlap between the 
first and second list was 281, and overlap between the second and third list 
was 197. The median TADA p-value for risk genes identified was approxi¬ 
mately 0.01 for each choice of A, suggesting the models were selecting genes 
of similar genetic information on average. But the model fitted with the 
strictest smoothing penalty (A = 0.15) identified a smaller number of genes, 
and yet it retained some genes with weaker TADA signals (95th percentile 
TADA p-value 0.3 versus 0.1 for the other smoothing values). This suggests 
that there might be greater harm in over-smoothing than under-smoothing. 
De Rubeis et al. (2014) identified 107 promising genes based on marginal ge¬ 
netics scores alone (TADA scores with FDR < 0.3), hence, we also examined 
consistency of the estimators over this smaller list of likely ASD risk genes. 
Of these genes, 12 of them do not have gene expression data at this period 
of brain development and cannot be included in our analysis, reducing our 
comparison for potential overlap to 95 genes. For the 3 levels of tuning pa¬ 
rameters, DAWN identified 82, 82 and 75 genes from this list, respectively. 
We conclude that although the total number of genes varies, the genes with 
the strongest signals are almost all captured by DAWN regardless of the 
tuning parameter chosen. Nevertheless, to obtain a more robust list of risk 
genes, it might be advisable to use the intersection of genes identified by a 
range of tuning parameters. 

The Ising model allows for the incorporation of numerous covariates such 
as TF binding sites either individually or en masse. To illustrate, we incorpo¬ 
rate the additional information from targets of FMRP [Darnell et al. (2011)]. 
These target genes have been shown to be associated with autism [lossifov 
et al. (2012)], hence, it is reasonable to conjecture that this covariate might 
improve the power of autism risk gene detection. Indeed, the additional term 
is significant in the Ising model (p < 0.005 obtained from Algorithm 4 and 
B is set as 200). Applying model (3.4) to the early FC period, we discov¬ 
ered 242 genes of which 118 have at least one dnLoF mutation. Four of the 
genes with one dnLoF mutation are newly discovered after we incorporate 
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Fig. 6. Risk genes identified after incorporating FMRP targets. 


the TF information. Those four genes are TRIP 12, RIMBP2, ZNF462 and 
ZNF238. Figure 6 shows the connectivity of risk genes after incorporating 
the TF information. 

6. Conclusion and discussion. In this paper we propose a novel frame¬ 
work for network assisted genetic association analysis. The contributions of 
this framework are as follows: first, the PNS algorithm utilizes the node 
specific information so that the accuracy of network estimation can be 
greatly improved; second, this framework provides a systematic approach 
for combining the estimated gene network and individual genetic scores; 
third, the framework can efficiently incorporate additional structural infor¬ 
mation concerning the dependence between genes, such as the targets of key 
TFs. 

A key insight arises in our comparison of the HMRF model using a va¬ 
riety of network estimation procedures. The Glasso approach tries to re¬ 
construct the whole network, while the PNS approach focuses on estimat¬ 
ing only the portions of the network that capture the dependence between 
disease-associated genes. It might seem counterintuitive that the HMRF 
model can achieve better power when the network is estimated by the PNS 
algorithm rather than by other existing high-dimensional network estimation 
approaches such as Glasso. Why would we gain better power when giving 
up much of the structural information? Results using the oracle show that 
HMRF works best when provided with the complete and accurate network 
(Table 1). The challenge in the high-dimensional setting is that it is infea- 
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sible to estimate the entire network successfully. Hence, the PNS strategy 
of focusing effort on the key portions of the network is superior. With this 
approach more key edges are estimated correctly relative to the number of 
false edges incorporated into the network. 

While we build on ideas developed in the DAWNq model [Liu et al. 
(2014)], the approach presented here extends and improves DAWN^ in sev¬ 
eral critical directions. In the original DAWNq model, the gene network was 
estimated from the adjacency matrix obtained by thresholding the correla¬ 
tion matrix. To obtain a sparse network, DAWN^ grouped tightly correlated 
genes together into multi-gene supernodes. DAWN uses PNS to obtain a 
sparse network directly without the need for supernodes. This focused net¬ 
work permits a number of improvements in DAWN. Because each node in 
the network produced by the PNS algorithm corresponds to a single gene, it 
is possible to directly apply the Bayesian FDR approach to determine risk 
genes. In contrast, the DAWNq required a second screening of genes based 
on p-values to determine risk genes after the HMRF step. Finally, DAWN 
is more flexible and allows for the incorporation of other covariates into the 
model. 

The proposed framework is feasible under different scenarios and has a 
wide application in various problems. In this paper, we extended the Ising 
model so that the proposed network assisted analysis framework can be 
applied to incorporate both the gene co-expression network and the gene 
regulation network. This framework can also be naturally extended to in¬ 
corporate the PPI network together with the gene co-expression network by 
simply adding another parameter in the Ising model. These three different 
types of networks can even be integrated simultaneously to maximize the 
power of risk gene detection. Moreover, the proposed risk gene discovery 
framework can be applied not only to ASD but also to many other complex 
disorders. 
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SUPPLEMENTARY MATERIAL 

Supplemental Table 1: Statistics for all genes analyzed in early and mid 
FC periods (DOL 10.1214/15-AOAS844SUPP; .zip). Column min_EDR is 
the minimum value of EDR of both periods. Eor the risk_early and risk_mid 
columns, a gene was labeled 1 if it was identified. EDR_early and EDR_mid 
column report the EDR value of each gene in early and mid EC periods. The 
dn.LoE column is the number of identified dnLoE mutations in each gene. 
The p-value column is the TADA p-value. 
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